Goto

Collaborating Authors

 emr cluster


Connect Amazon EMR and RStudio on Amazon SageMaker

#artificialintelligence

RStudio on Amazon SageMaker is the industry's first fully managed RStudio Workbench integrated development environment (IDE) in the cloud. You can quickly launch the familiar RStudio IDE and dial up and down the underlying compute resources without interrupting your work, making it easy to build machine learning (ML) and analytics solutions in R at scale. In conjunction with tools like RStudio on SageMaker, users are analyzing, transforming, and preparing large amounts of data as part of the data science and ML workflow. Data scientists and data engineers use Apache Spark, Hive, and Presto running on Amazon EMR for large-scale data processing. Using RStudio on SageMaker and Amazon EMR together, you can continue to use the RStudio IDE for analysis and development, while using Amazon EMR managed clusters for larger data processing.


Perform interactive data engineering and data science workflows from Amazon SageMaker Studio notebooks

#artificialintelligence

Amazon SageMaker Studio is the first fully integrated development environment (IDE) for machine learning (ML). With a single click, data scientists and developers can quickly spin up Studio notebooks to explore and prepare datasets to build, train, and deploy ML models in a single pane of glass. We're excited to announce a new set of capabilities that enable interactive Spark-based data processing from Studio notebooks. Data scientists and data engineers can now visually browse, discover, and connect to Spark data processing environments running on Amazon EMR, right from your Studio notebooks in a few simple clicks. After you're connected, you can interactively query, explore and visualize data, and run Spark jobs to prepare data using the built-in SparkMagic notebook environments for Python and Scala.


Perform interactive data processing using Spark in Amazon SageMaker Studio Notebooks

#artificialintelligence

Amazon SageMaker Studio is the first fully integrated development environment (IDE) for machine learning (ML). With a single click, data scientists and developers can quickly spin up Studio notebooks to explore datasets and build models. You can now use Studio notebooks to securely connect to Amazon EMR clusters and prepare vast amounts of data for analysis and reporting, model training, or inference. You can apply this new capability in several ways. For example, data analysts may want to answer a business question by exploring and querying their data in Amazon EMR, viewing the results, and then either alter the initial query or drill deeper into the results.


Using Distributed Machine Learning to Model Big Data Efficiently

#artificialintelligence

To use spark, we can either run it on an AWS EMR cluster, or if you just want to try it out and play with it, you can also run it on your local Jupiter notebook. There have been many great articles on how to set up your notebook on AWS EMR to use PySpark such as this one. EMR cluster configuration will also largely affect your runtime, which I will mention in the last part. For preprocessing the data, I will be using the Spark RDD manipulation to perform exploratory data analysis and visualization. The rest of the Spark preprocessing code and Plotly visualization code can be found on the Github repo, but here are the graphs out of our initial exploratory analysis.


Build PMML-based Applications and Generate Predictions in AWS Amazon Web Services

#artificialintelligence

If you generate machine learning (ML) models, you know that the key challenge is exporting and importing them into other frameworks to separate model generation and prediction. Many applications use PMML (Predictive Model Markup Language) to move ML models from one framework to another. PMML is an XML representation of a data mining model. In this post, I show how to build a PMML application on AWS. First, you build a PMML model in Apache Spark using Amazon EMR.


Building a recommendation engine with AWS Data Pipeline, Elastic MapReduce and Spark

#artificialintelligence

From Google's advertisements to Amazon's product suggestions, recommendation engines are everywhere. As users of smart internet services, we've become so accustomed to seeing things we like. This blog post is an overview of how we built a product recommendation engine for Hubba. I'll start with an explanation of different types of recommenders and how we went about the selection process. Then I'll cover our AWS solution before diving into some implementation details. Content-based recommenders use discrete properties of an item, such as its tags.